NoSQL 데이터 저장소를 사용하여 어떤 확장 성 문제가 발생 했습니까? [닫은]
NoSQL은 관계형 데이터베이스의 역사와 ACID 보증을 위반하는 비 관계형 데이터 저장소를 말합니다. 널리 사용되는 오픈 소스 NoSQL 데이터 저장소는 다음과 같습니다.
- Cassandra (Cisco, WebEx, Digg, Facebook, IBM, Mahalo, Rackspace, Reddit 및 Twitter에서 사용되는 표 작성, Java로 작성)
- CouchDB (Erlang으로 작성되고 BBC 및 Engine Yard에서 사용하는 문서)
- 다이너마이트 (Erlang으로 작성된 키-값, Powerset에서 사용)
- HBase (Bing에서 사용되는 Java로 작성된 키-값)
- 하이퍼 테이블 (바이두에서 사용하는 C ++로 작성된 표)
- 카이 (Erlang으로 작성된 키-값)
- MemcacheDB (C로 작성되고 Reddit에서 사용되는 키-값)
- MongoDB (C ++로 작성되고 Electronic Arts, Github, NY Times 및 Sourceforge에서 사용하는 문서)
- Neo4j (일부 스웨덴어 대학에서 사용하는 Java로 작성된 그래프)
- Project Voldemort (Java로 작성되고 LinkedIn에서 사용하는 키-값)
- Redis (C로 작성된 키-값, Craigslist, Engine Yard 및 Github에서 사용)
- Riak (Comcast 및 Mochi Media에서 사용되는 Erlang으로 작성된 키-값)
- Ringo (Erlang으로 작성된 키-값, Nokia에서 사용)
- 스칼라리스 ( OnScale 에서 사용하는 Erlang으로 작성된 키-값)
- Terrastore (문서, Java로 작성)
- ThruDB (C ++로 작성된 문서, JunkDepot.com 에서 사용)
- 도쿄 내각 / 도쿄 폭군 (C로 작성된 키-값, Mixi.jp (일본 소셜 네트워킹 사이트)에서 사용)
SO 리더와 같은 특정 문제가 데이터 저장소를 사용하여 해결 한 문제와 사용한 NoSQL 데이터 저장소에 대해 알고 싶습니다.
질문 :
- NoSQL 데이터 저장소를 사용하여 어떤 확장 성 문제를 해결 했습니까?
- 어떤 NoSQL 데이터 저장소를 사용 했습니까?
- NoSQL 데이터 저장소로 전환하기 전에 어떤 데이터베이스를 사용 했습니까?
직접적인 경험을 찾고 있습니다. 그러지 않으면 대답하지 마십시오.
로드를 처리 할 수 있도록 작은 하위 프로젝트를 MySQL에서 CouchDB로 전환했습니다. 결과는 놀랍습니다.
약 2 년 전, http://www.ubuntuusers.de/ (아마도 가장 큰 독일 리눅스 커뮤니티 웹 사이트) 에 자체 작성 소프트웨어를 출시했습니다 . 이 사이트는 Python으로 작성되었으며 모든 예외를 잡아서 다른 작은 MySQL 기반 웹 사이트로 보낼 수있는 WSGI 미들웨어를 추가했습니다. 이 작은 웹 사이트는 해시를 사용하여 다른 버그를 확인하고 발생 횟수와 마지막 발생 횟수를 저장했습니다.
안타깝게도 릴리스 직후에 역 추적 로거 웹 사이트가 더 이상 응답하지 않았습니다. 우리는 메인 사이트의 프로덕션 DB에 잠금 문제가있어 거의 모든 요청에 예외가 발생했으며 테스트 단계에서 탐색하지 않은 몇 가지 다른 버그가있었습니다. 기본 사이트의 서버 클러스터 (트레이스 백 로거 제출 페이지라고 함)는 초당 여러 번 k입니다. 그리고 그것은 트레이스 백 로거를 호스팅하는 작은 서버에게는 너무 많은 길이었습니다 (이것은 이미 개발 목적으로 만 사용되었던 오래된 서버였습니다).
현재 CouchDB는 다소 인기가 있었기 때문에 그것을 사용 해보고 작은 역 추적 기록기를 작성하기로 결정했습니다. 새로운 로거는 단일 파이썬 파일로만 구성되었으며, 정렬 및 필터 옵션과 제출 페이지가 포함 된 버그 목록을 제공했습니다. 그리고 백그라운드에서 CouchDB 프로세스를 시작했습니다. 새로운 소프트웨어는 모든 요청에 매우 신속하게 응답했으며 방대한 양의 자동 버그 보고서를 볼 수있었습니다.
한 가지 흥미로운 점은 이전의 솔루션이 이전 전용 서버에서 실행 중이 었다는 점입니다. 반면에 새로운 CouchDB 기반 사이트는 리소스가 매우 제한된 공유 xen 인스턴스에서만 실행되고있었습니다. 그리고 키-값 저장소의 강점을 사용하여 수평 적으로 확장하지도 않았습니다. CouchDB / Erlang OTP의 기능은 아무것도 잠그지 않고 동시 요청을 처리 할 수있는 능력으로 이미 요구를 충족시키기에 충분했습니다.
이제 빠르게 작성된 CouchDB-traceback 로거가 여전히 실행 중이며 기본 웹 사이트에서 버그를 탐색하는 데 유용한 방법입니다. 어쨌든 약 한 달에 한 번 데이터베이스가 너무 커지고 CouchDB 프로세스가 종료됩니다. 그러나 CouchDB의 compact-db 명령은 크기를 몇 GB에서 몇 KB로 다시 줄이고 데이터베이스가 다시 실행 중입니다 (cronjob을 추가하는 것을 고려해야 할 수도 있습니다 ... 0o).
요약하면 CouchDB는 분명히이 하위 프로젝트에 대한 최선의 선택 (또는 적어도 MySQL보다 더 나은 선택)이었으며 잘 작동합니다.
내 현재 프로젝트입니다.
정규화 된 구조로 18,000 개의 객체 저장 : 8 개의 서로 다른 테이블에 9 만 행. Java 객체 모델에 검색하고 매핑하는 데 1 분이 걸렸습니다. 모든 것이 올바르게 색인화되어 있습니다.
간단한 텍스트 표현 (1 개의 테이블, 18,000 개의 행, 3 초)을 사용하여 키 / 값 쌍으로 저장하여 모두 검색하고 Java 객체를 재구성합니다.
비즈니스 측면에서 : 첫 번째 옵션은 실현 가능하지 않았습니다. 두 번째 옵션은 앱이 작동 함을 의미합니다.
기술 세부 정보 : SQL 및 NoSQL 모두를 위해 MySQL에서 실행! 우수한 트랜잭션 지원, 성능 및 데이터 손상, 우수한 확장 성, 클러스터링 지원 등의 입증 된 실적을 위해 MySQL을 고수합니다.
MySQL의 데이터 모델은 이제 주요 필드 (정수)와 큰 "값"필드입니다. 기본적으로 큰 TEXT 필드입니다.
우리는 새로운 플레이어 (CouchDB, Cassandra, MongoDB 등)와 함께 가지 않았습니다. 각각 자체적으로 훌륭한 기능 / 성능을 제공하지만 상황에 따라 항상 단점 (예 : 누락 / 미성숙 한 Java 지원)이 있었기 때문입니다.
MySQL을 사용 (AB)의 추가 혜택은 - 우리의 모델의 비트 할 관계형 작업은 쉽게 우리의 키 / 값 저장소 데이터에 연결할 수 있습니다.
Update: here's an example of how we represented text content, not our actual business domain (we don't work with "products") as my boss'd shoot me, but conveys the idea, including the recursive aspect (one entity, here a product, "containing" others). Hopefully it's clear how in a normalised structure this could be quite a few tables, e.g. joining a product to its range of flavours, which other products are contained, etc
Name=An Example Product
Type=CategoryAProduct
Colour=Blue
Size=Large
Flavours={nice,lovely,unpleasant,foul}
Contains=[
Name=Product2
Type=CategoryBProduct
Size=medium
Flavours={yuck}
------
Name=Product3
Type=CategoryCProduct
Size=Small
Flavours={sublime}
]
Todd Hoff's highscalability.com has a lot of great coverage of NoSQL, including some case studies.
The commercial Vertica columnar DBMS might suit your purposes (even though it supports SQL): it's very fast compared with traditional relational DBMSs for analytics queries. See Stonebraker, et al.'s recent CACM paper contrasting Vertica with map-reduce.
Update: And Twitter's selected Cassandra over several others, including HBase, Voldemort, MongoDB, MemcacheDB, Redis, and HyperTable.
Update 2: Rick Cattell has just published a comparison of several NoSQL systems in High Performance Data Stores. And highscalability.com's take on Rick's paper is here.
We moved part of our data from mysql to mongodb, not so much for scalability but more because it is a better fit for files and non-tabular data.
In production we currently store:
- 25 thousand files (60GB)
- 130 million other "documents" (350GB)
with a daily turnover of around 10GB.
The database is deployed in a "paired" configuration on two nodes (6x450GB sas raid10) with apache/wsgi/python clients using the mongodb python api (pymongo). The disk setup is probably overkill but thats what we use for mysql.
Apart from some issues with pymongo threadpools and the blocking nature of the mongodb server it has been a good experience.
I apologize for going against your bold text, since I don't have any first-hand experience, but this set of blog posts is a good example of solving a problem with CouchDB.
Essentially, the textme application used CouchDB to deal with their exploding data problem. They found that SQL was too slow to deal with large amounts of archival data, and moved it over to CouchDB. It's an excellent read, and he discusses the entire process of figuring out what problems CouchDB could solve and how they ended up solving them.
We've moved some of our data we used to store in Postgresql and Memcached into Redis. Key value stores are much better suited for storing hierarchical object data. You can store blob data much faster and with much less development time and effort than using an ORM to map your blob to a RDBMS.
I have an open source c# redis client that lets you store and retrieve any POCO objects with 1 line:
var customers = redis.Lists["customers"]; //Implements IList<Customer>
customers.Add(new Customer { Name = "Mr Customer" });
Key value stores are also much easier to 'scale-out' as you can add a new server and then partition your load evenly to include the new server. Importantly, there is no central server that will limit your scalability. (though you will still need a strategy for consistent hashing to distribute your requests).
I consider Redis to be a 'managed text file' on steroids that provides fast, concurrent and atomic access for multiple clients, so anything I used to use a text file or embedded database for I now use Redis. e.g. To get a real-time combined rolling error log for all our services (which has notoriously been a hard task for us), is now accomplished with only a couple of lines by just pre-pending the error to a Redis server side list and then trimming the list so only the last 1000 are kept, e.g:
var errors = redis.List["combined:errors"];
errors.Insert(0, new Error { Name = ex.GetType().Name, Message = ex.Message, StackTrace = ex.StackTrace});
redis.TrimList(errors, 1000);
I have no first-hand experiences., but I found this blog entry quite interesting.
I find the effort to map software domain objects (e.g. aSalesOrder, aCustomer...) to two-dimensional relational database (rows and columns) takes a lot of code to save/update and then again to instantiate a domain object instance from multiple tables. Not to mention the performance hit of having all those joins, all those disk reads... just to view/manipulate a domain object such as a sales order or customer record.
We have switched to Object Database Management Systems (ODBMS). They are beyond the capabilities of the noSQL systems listed. The GemStone/S (for Smalltalk) is such an example. There are other ODBMS solutions that have drivers for many languages. A key developer benefit, your class hierarchy is automatically your database schema, subclasses and all. Just use your object oriented language to make objects persistent to the database. ODBMS systems provide an ACID level transaction integrity, so it would also work in financial systems.
I switched from MySQL(InnoDB) to cassandra for a M2M system, which basically stores time-series of sensors for each device. Each data is indexed by (device_id,date) and (device_id,type_of_sensor,date). The MySQL version contained 20 millions of rows.
MySQL:
- Setup in master-master synchronization. Few problem appeared around loss of synchronization. It was stressful and especially in the beginning could take hours to fix.
- Insertion time wasn't a problem but querying required more and more memory as the data grew. The problem is the indexes are considered as a whole. In my case, I was only using a very thin parts of the indexes that were necessary to load in memory (only few percent of the devices were frequently monitored and it was on the most recent data).
- It was hard to backup. Rsync isn't able to do fast backups on big InnoDB table files.
- It quickly became clear that it wasn't possible to update the heavy tables schema, because it took way too much time (hours).
- Importing data took hours (even when indexing was done in the end). The best rescue plan was to always keep a few copies of the database (data file + logs).
- Moving from one hosting company to an other was really a big deal. Replication had to be handled very carefully.
Cassandra:
- Even easier to install than MySQL.
- Requires a lot of RAM. A 2GB instance couldn't make it run in the first versions, now it can work on a 1GB instance but it's not idea (way too many data flushes). Giving it 8GB was enough in our case.
- Once you understand how you organize your data, storing is easy. Requesting is a little bit more complex. But once you get around it, it is really fast (you can't really do mistake unless you really want to).
- If previous step was done right, it is and stays super-fast.
- It almost seems like data is organized to be backuped. Every new data is added as new files. I personally, but it's not a good thing, flush data every night and before every shutdown (usually for upgrade) so that restoring takes less time, because we have less logs to read. It doesn't create much files are they are compacted.
- Importing data is fast as hell. And the more hosts you have the faster. Exporting and importing gigabytes of data isn't a problem anymore.
- Not having a schema is a very interesting thing because you can make you data evolve to follow your needs. Which might mean having different versions of your data at the same time on the same column family.
- Adding a host was easy (not fast though) but I haven't done it on a multi-datacenter setup.
Note: I have also used elasticsearch (document oriented based on lucene) and I think it should be considered as a NoSQL database. It is distributed, reliable and often fast (some complex queries can perform quite badly).
I don't. I would like to use a simple and free key-value store that I can call in process but such thing doesn't exist afaik on the Windows platform. Now I use Sqlite but I would like to use something like Tokyo Cabinet. BerkeleyDB has license "issues".
However if you want to use the Windows OS your choice of NoSQL databases is limited. And there isn't always a C# provider
I did try MongoDB and it was 40 times faster than Sqlite, so maybe I should use it. But I still hope for a simple in process solution.
I used redis to store logging messages across machines. It was very easy to implement, and very useful. Redis really rocks
We replaced a postgres database with a CouchDB document database because not having a fixed schema was a strong advantage to us. Each document has a variable number of indexes used to access that document.
I have used Couchbase in the past and we encountered rebalancing problems and host of other issues. Currently I'm using Redis in several production projects. I'm using redislabs.com which is a managed service for Redis that takes care of scaling your Redis clusters. I've published a video on object persistence on my blog at http://thomasjaeger.wordpress.com that shows how to use Redis in a provider model and how to store your C# objects into Redis. Take a look.
I would encourage anyone reading this to try Couchbase once more now that 3.0 is out the door. There are over 200 new features for starters. The performance, availability, scalability and easy management features of Couchbase Server makes for an extremely flexible, highly available database. The management UI is built-in and the APIs automatically discover the cluster nodes so there is no need for a load balancer from the application to the DB. While we don't have a managed service at this time you can run couchbase on things like AWS, RedHat Gears, Cloudera, Rackspace, Docker Containers like CloudSoft, and much more. Regarding rebalancing it depends on what specifically you're referring to but Couchbase doesn't automatically rebalance after a node failure, as designed, but an administrator could setup auto failover for the first node failure and using our APIs you can also gain access to the replica vbuckets for reading prior to making them active or using the RestAPI you can enforce a failover by a monitoring tool. This is a special case but is possible to be done.
We tend not to rebalance in pretty much any mode unless the node is completely offline and never coming back or a new node is ready to be balanced in automatically. Here are a couple of guides to help anyone interested in seeing what one of the most highly performing NoSQL databases is all about.
Lastly, I would also encourage you to check out N1QL for distributed querying:
Thanks for reading and let me or others know if you need more help!
Austin
I have used Vertica in the past.It relies on columnar compression & expedites disk reads and lowers storage needs to make the most of your hardware. Faster data loads and higher concurrency lets you serve analytics data to more users with minimum latency.
Earlier, we were querying Oracle database having billions of records & the performance was very sub-optimal. The queries took 8 to 12s to run, even after optimizing with SSD. Hence, we felt the need to use a faster read optimized, analytics oriented database. With Vertica Clusters behind the lean service layer, we could run APIs with sub-second performance.
Vertica stores data in projections in a format that optimizes query execution. Similar to materialized views, projections store result sets on disk OR SSD rather than compute them each time they are used in a query.Projections provide the following benefits:
- Compress and encode data to reduce storage space.
- Simplify distribution across the database cluster.
- Provide high availability and recovery.
Vertica optimizes the database by distributing data across cluster using Segmentation.
- Segmentation places a portion of data on a node.
- It evenly distributes data on all nodes. Thus, each node performs a piece of the querying process.
- The query runs on the cluster and every node receives the query plan.
- The results of the queries are aggregated and used to create the output.
For more, please refer to Vertica documentation @ https://www.vertica.com/knowledgebase/
'IT' 카테고리의 다른 글
이 새로운 ASP.NET 보안 취약점은 얼마나 심각하며 어떻게 해결할 수 있습니까? (0) | 2020.05.14 |
---|---|
NoSQL 데이터 저장소를 사용하여 어떤 확장 성 문제가 발생 했습니까? (0) | 2020.05.14 |
github에서 검색하여 정확히 일치하는 항목을 얻는 방법 (예 : Google의 인용 부호) (0) | 2020.05.14 |
Git의 FETCH_HEAD는 무엇을 의미합니까? (0) | 2020.05.14 |
파이썬에서 메모리 사용량을 어떻게 프로파일합니까? (0) | 2020.05.14 |