- Best llm gpu requirements reddit ) What about a 13 gen i7 13700f? My quest is saying my computer doesn’t meet the minimum requirements and I can’t understand why. Thank you for your recommendations ! The "minimum" is one GPU that completely fits the size and quant of the model you are serving. I have recently been using Copilot from Bing and I must say, it is quite good. EDIT: I have 4 GB GPU RAM and in addition to that 16 Gigs of ordinary DDR3 RAM. Best is so conditionally-subjective. I didn't see any posts talking about or comparing how different type/size of LLM influences the performance of the whole RAG system. You get DDR5 speeds plus AVX512 instructions set, which now is absent on Intel platform. i've used both A1111 and comfyui and it's been working for months now. Here is a 4-bit GPTQ version that will work with ExLlama, text-generation-webui etc. Everything runs locally and accelerated with native GPU on the phone. Which among these would work smoothly without heating issues? P. I'd like to speed things up and make it as affordable as possible. Technology is changing fast but I see most folks being productive with 8b models fully offloaded to GPU. But for my needs, the free ones are good. My question is what is the best quantized (or full) model that can run on Colab's resources without being too slow? I mean at least 2 tokens per second. If you have concerns about the legality of any sexual activity, it's best to consult the Performance-wise, this option is robust, and it can scale up to 4 or more cards (I think the maximum for NVLink 1 is six cards from memory), creating a substantial 64GB GPU. 92 GB So using 2 GPU with 24GB (or 1 GPU with 48GB), we could offload all the layers to the Coding really means "software engineering", which involves formal documentation, lots of business requirements, lots of functional requirements, architecture, strategy, best practices, multi-platform considerations, code maintenance considerations, planning for FUTURE refactoring, and so on. com/r/Oobabooga/comments/126dejd/comment/jebvlt0/ If you can find a Hi everyone, I’m upgrading my setup to train a local LLM. My GPU was pretty much busy for months with AI art, but now that I bought a better new one, I have a 12GB GPU (RTX with CUDA cores) sitting in a computer built mostly from recycled used spare parts ready to use. I agree that this is your best solution, or just rent a good gpu online and run a 70b model for like $0. With the newest drivers on Windows you can not use more than 19-something Gb of VRAM, or everything would just freeze. This VRAM calculator helps you figure out the required memory to run an LLM, given . Rule 1: PC build questions LLM Startup Embraces AMD GPUs, Says ROCm Has ‘Parity’ With Nvidia’s CUDA Platform | CRN. Give me the Ubiquiti of Local LLM infrastructure. Also, Bonus features of GPU: Stable diffusion, LLM Lora training. In general, GPU inferencing is prefered if you have the VRAM , as it is 100x faster than CPU. sampling: temp = 0. If you step up the budget a bit, a That's the FP16 version. Hi all, here's a buying guide that I made after getting multiple questions on where to start from my network. My understanding is that we can reduce system ram use if we offload LLM layers on to the GPU memory. LLM was barely coherent. Adequate vRAM to support the sizeable parameters of LLMs in FP16 and FP32, without quantization. GPT4-X-Vicuna-13B q4_0 and you could maybe offload like 10 layers (40 is whole model) to the GPU using the -ngl argument in llama. Q4_K_M. Currently I am running a merge of several 34B 200K models, but I am I posted my latest LLM Comparison/Test just yesterday, but here's another (shorter) comparison/benchmark I did while working on that - testing different formats and quantization levels. I'd suggest koboldcpp and openBlas with llama-3-8B instead, quick enough to get timely responses, competent enough that you'll know when it is or isn't leading you astray. I've added some models to the list and expanded the first part, sorted results into tables, and I have a setup with 1x P100 GPUs and 2x E5-2667 CPUs and I am getting around 24 to 32 tokens/sec on Exllama, you can easily fit a 13B and 15B GPTQ models on the GPU and there is a special adaptor to convert from GPU powercable to the CPU cable needed. Many folks frequently don't use the best available model because it's not the best for their requirements / preferences (e. 5 Gbps PCIE 4. Preferably Nvidia model cards though amd cards are infinitely cheaper for higher vram which is always best. What are now the bests 12GB VRAM runnable LLMs for: programming (mostly python) chat Thanks! GPU: MSI RX7900-XTX Gaming trio classic (24GB VRAM) RAM: Corsair Vengeance 32GBx2 5200MHz I think the setup is one of the best VFM but only if it works for GenAI :( Exploration: After spending nearly 10 days with my setup, these are my observations: AMD has a lot to do in terms of catching up to Nvidia's software usability. 800000, top_k = 40, top_p = 0. 7B GPTQ or EXL2 (from 4bpw to 5bpw). 0 x4 Internal Solid State Drive (SSD) WDS200T2X0E I'm looking for the best uncensored local LLMs for creative story writing. S. The understanding of dolphin-2. If you want to install a second gpu, even a pcie 1x (with riser to 16x) is sufficient in principle. I am excited about Phi-2 but some of the posts here indicate it is slow due to some reason despite being a small model. No GPUs yet (my non-LLM workloads can't take advantage of GPU acceleration) but I'll be buying a few refurbs eventually. Do not use GGUF for pure GPU inferencing, as that is much slower than the other methods. However, most of models I found seem to target less then 12gb of Vram, but I have an RTX 3090 with 24gb of Vram. i managed to push it to 5 tok/s by allowing15 logical cores. though that was indeed a The base llama one is good for normal (official) stuff Euryale-1. GPU: MSI Suprim GeForce RTX 4090 24GB GDDR6X PCI Express 4. People serve lots of users through kobold horde using only single and dual GPU configurations so this isn't something you'll need 10s of 1000s for. 6 GHz, 4c/8t), Nvidia Geforce GT 730 GPU (2gb vram), and 32gb DDR3 Ram (1600MHz) be enough to run the 30b llama model, and at a decent speed? Specifically, GPU isn't used in llama. I can run the 65b 4-bit quantized model of LLaMA right now but Loras / open chat models are limited. My goal was to find out which format and quant to focus on. Join the community and come discuss games like Codenames, Wingspan, Brass, and all your other favorite games! miqu 70B q4k_s is currently the best, split between CPU/GPU, if you can tolerate a very slow generation speed. Here is my benchmark-backed list of 6 graphics cards I found to be the I am a total newbie to LLM space. Oobabooga WebUI, koboldcpp, in fact, any other software made for easily accessible local LLM model text generation and chatting with AI models privately have similar best-case scenarios when it comes to the top consumer GPUs you can use with them to maximize performance. I'm mostly looking for ones that can write good dialogue and descriptions for fictional stories. com) I'm planning to build a GPU PC specifically for working with large language models (LLMs), not for gaming. Firstly, would an Intel Core i7 4790 CPU (3. comments sorted by Best Top New I've done some consulting on this and I found the best way is to break it down piece by piece, and hopefully offer some solutions and alternatives for the businesses. 3-L2-70B is good for general RP/ERP stuff, really good at staying in character Spicyboros2. Not the fastest thing in the world running local - only about 5 tps - but the responses and . What I managed so far: Found instructions to make 70B run on VRAM only with a 2. Nkingsy • ZOTAC Gaming GeForce RTX™ 3090 Trinity OC 24GB GDDR6X 384-bit 19. I need something lightweight that can run on my machine, so maybe 3B, 7B or 13B. Reasonable Graphics card for LLM AND Gaming . And a personal conclusion for you: on the LLM world I don't think we, personal usage project maker guys are in an advantage even on buying a medium performance graphics card, even a second hard one, because the prices on 1000 tokens (look at openai, where chatgpt is, actually, the best, or look at claude 2 which is good enough and the prices Typically they don't exceed the performance of a good GPU. CPU: Since the GPU will be the highest priority for LLM inference, how crucial is the CPU? I'm considering an Intel socket 1700 for future upgradability. 2 2280 2TB PCI-Express 4. 7 GHz, ~$130) in terms of impacting LLM performance? Also check how much VRAM your graphics card has, some programs like llama. cpp, nanoGPT, FAISS, and langchain installed, also a few models locally resident with several others available remotely via the GlusterFS mountpoint. On the PC side, get any laptop with a mobile Nvidia 3xxx or 4xxx GPU, with the most GPU VRAM that you can afford. It currently uses CPU only and I will try to get an update out today that sets to GPU with the model_args={‘gpu’: True} if you feed that into the ModelPack it will run on GPU, assuming you have CUDA compatible machine. So I took the best 70B according to my previous tests, and re-tested that again with various formats and quants. As a bonus, Linux by itself easily gives you something like 10-30% performance boost for LLMs, and on top of that, running headless Linux completely frees up the entire VRAM so you can have it all for your LLM in its entirety, which is impossible in Windows because Windows itself reserves part of the VRAM just to render the desktop. This AMD Radeon RX 480 8GB is still a value king. gguf into memory without May be it is a bit late, but the best way to go with CPU inference on x86 is Ryzen 7000 series. cpp, so are the CPU and ram enough? Currently have 16gb so wanna know if going to 32gb would be all I need. For my 30B model with 1B tokens, I want to complete the training within 24 hours. 13s gen t: 15. I have kept these tests unchanged for as long as possible to enable direct comparisons and establish a consistent ranking for all models tested, but I'm taking the release of Llama 3 as an opportunity to conclude this test series as planned. I want to experiment with medium sized models (7b/13b) but my gpu is old and has only 2GB vram. That's an approximate list. 4x A6000 ADA v. Also, for this Q4 version I found 13 layers GPU offloading is optimal. 41s speed: 5. cpp can put all or some of that data into the GPU if CUDA is working. 2. From what I have read hear with 7950x you will have double of t/s compared to 5950x. So not ones that are just good at roleplaying, unless that helps with dialogue. Updates. cpp, so it also obeys most of llama. When I For a gpu, whether 3090 or 4090, you need one free pcie slot (electrical), which you will probably have anyway due to the absence of your current gpu – but the 3090/4090 takes physically the space of three slots. They do exceed the performance of the GPUs in non-gaming oriented systems and their power consumption for a given level of performance is probably 5-10x better than a What's the best LLM to run on a raspberry pi 4b 4 or 8GB? I am trying to look for the best model to run, it needs to be a model that would be possible to control via python, it should run locally (don't want it to be always connected to the internet), it should run at at least 1 token per second, it should be able to run, it should be pretty good. I'm currently in the market of building my first PC in over a decade. You could either run some smaller models on your GPU at pretty fast speed or bigger models with CPU+GPU with significantly lower speed but higher quality. Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D I made this post as an attempt to collect best practices and ideas. That's always a good option probably but I try to avoid using openAI all together. Stable Diffusion (AI Drawing), local LLM (ChatGPT alternative) fall under this category. 3-$0. Very few companies in the world 17 votes, 31 comments. cpp parameters including num_gpu which defines how many LLM layers will be offloaded to GPU. 2 is capable of generating content that society might frown upon, can and will be happy to produce some crazy stuff, especially when it After using GPT4 for quite some time, I recently started to run LLM locally to see what's new. io, they're all principally focused on getting an IPEX-LLM environment prepared for running on GPU rather than CPU. Does anyone here have experience building or using external GPU servers for LLM training and inference? Someone please show me the light to a "Prosumer" solution. 3B model requires 6Gb of memory and 6Gb of allocated disk storage to store the model (weights). CPU is nice with the easily expandable RAM and all, but you’ll lose out on a lot of speed if you don’t offload at least a couple layers to a fast gpu If you only care about local AI model, with no regards for gaming performance, dual 3090 will be way better, as LLM front ends like Oobabooga supports multi-GPU VRAM loading. 5 bpw that run fast but the perplexity was unbearable. I used Llama-2 as the guideline for VRAM Here's a report from a user running LLaMA 30Bq4 at good speeds on his P40: https://www. For ollama you just need to add following parameter into the Modelfile: PARAMETER num_gpu XX Finally! After a lot of hard work, here it is, my latest (and biggest, considering model sizes) LLM Comparison/Test: This is the long-awaited follow-up to and second part of my previous LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. EDIT: As a side note power draw is very nice, around 55 to 65 watts on the card currently running inference according to NVTOP. cpp? I tried running this on my machine (which, admittedly has a 12700K and 3080 Ti) with 10 layers offloaded and only 2 threads to try and get something similar-ish to your setup, and it peaked at 4. 6-mistral-7b is impressive! It feels like GPT-3 level understanding, although the long-term memory aspect is not as good. The implementation is available on-line with our Intel®-Extension-for-Pytorch repository. 40 · 48 comments . and SD works using my GPU on ubuntu as well. That is changing fast. Apparently there are some issues with multi-gpu AMD setups that don't run all on matching, direct, GPU<->CPU PCIe slots - source. The AI takes approximately 5-7 seconds to respond in-game. The key seems to be good training data with simple examples that teach the desired skills (no confusing Reddit posts!). Here's my latest, and maybe last, Model Comparison/Test - at least in its current form. However, it's worth noting that laws regarding sexual activities can differ, and there may be specific legal restrictions or cultural norms in certain places. cpp (with GPU offloading. For an extreme example, how would a high-end i9-14900KF (24 threads, up to 6 GHz, ~$550) compare to a low-end i3-14100 (4 threads, up to 4. I've got my own little project in the works going on, currently doing very fast 2048-token inference The TinyStories models aren't that smart, but they write coherent little-kid-level stories and show some reasoning ability with only a few Transformer layers and ≤ 0. 035b parameters. Whether a 7b model is "good" in the first place is relative to your expectations. Training is ≤ 30 hours on a single GPU. We will also update this for Mac M series laptops later. - GPU nVidia GeForce RTX4070 - 12Gb VRAM, 504. 0 Video Card RTX 4090 SUPRIM LIQUID X 24G SSD: WD_BLACK SN850X NVMe M. Bang for buck 2x3090s is the best setup. Only looking for a laptop for portability My understanding is that we can reduce system ram use if we offload LLM layers on to the GPU memory. exe to run the LLM, I chose 16 threads since the Ryzen 9 7950X has 16 physical cores, and select my only GPU to offload my 40 layers to. Frequent gpu crashes and driver issues (backed by 3 comments) Defective or damaged products (backed by 16 comments) Compatibility issues with bios and motherboards (backed by 2 comments) According to Reddit, AMD is considered a reputable brand. You can run 30B 4bit on a high-end GPU with 24gb VRAM, or with a good (but still consumer grade) CPU and 32GB of RAM at acceptable speed. My goal is to achieve decent inference speed and handle popular models like Llama3 medium and Phi3 which possibility of expansion. It seems that most people are using ChatGPT and GPT-4. At least as of right now, I think what models people are actually using while coding is often more informative. If you only care about local AI model, with no regards for gaming performance, dual 3090 will be way better, as LLM front ends like Oobabooga supports multi-GPU VRAM loading. I need some advice about what hardware to buy in order to build an ML / DL workstation for home private experiments, i intend to play with different LLM models, train some and try to tweak the way the models are built and understand what impact training speeds, so i will have to train, learn the results, tweak the model / data / algorithms and train again MLC LLM for Android is a solution that allows large language models to be deployed natively on Android devices, plus a productive framework for everyone to further optimize model performance for their use cases. With dual 3090, 48 GB VRAM opens up doors to 70b models entirely on VRAM The task is to take in a word and then find the most similar word on a fixed list of 500 words given in the prompt (where there are also a set of rules for what similar means). I am starting to like a lot. So I was wondering if there is a LLM with more parameters that could be a really good match with my GPU. task(s), language(s), latency, throughput, costs, hardware, etc) Hybrid GPU+CPU inference is very good. The Best GPUs for Deep Learning in 2023 — An In-depth Analysis (timdettmers. I enabled "keep entire model in RAM", "Apple Metal (GPU)" and set the context length to 15000. It was a good post. I'm looking for a LLM that i can run locally and connect it to VScode basically i want to find out if there is a good LLM for coding that can support VSCode ? I don't have a good internet speed using Copilot or another online solution seems to be slow so i thought maybe there is an offline version of something like copilot that i can run Yeah, exactly. 12x 70B, 120B, ChatGPT/GPT-4. So, the results from LM Studio: time to first token: 10. s. using the install process detailed in readingthedocs. Still anxiously anticipating your decision about whether or not to share those quantized models. It's about 14 GB and it won't work on a 10 GB GPU. How many GPUs do I require? Well, Now, you can utilize a simple calculator to estimate or make an educated guess. 8 per hour while writing, that’s what I usually do. To maintain good speeds, I always try to keep chats context max around 2000 context to get sweet spot of High memory as well as maintaining good speeds Even with just 2K content, my 13B chatbot is able to remember pretty much everything, thanks to vector database, it's kinda fuzzy but I applied all kinds of crazy tricks to make it work every Which GPU server spec is the best for pretraining roBERTa-size LLMs with a $50K budget, 4x RTX A6000 v. Does anyone have suggestions for which LLM(s) would work best with such a GPU rig? Specifically, what would work well for a rig with seven GPU's with 8GB VRAM? Hey Folks, I was planning to get a Macbook Pro m2 for everyday use and wanted to make the best choice considering that I'll want to run some LLM locally as a helper for coding and general use. I think I use 10~11Go for 13B models like vicuna or gptxalpaca. (Mac User) comments sorted by Best Top New Controversial Q&A Add a Comment. For GPU-only you could choose these model families: Mistral-7B GPTQ/EXL2 or Solar-10. One of those T7910 with the E5-2660v3 is set up for LLM work -- it has llama. I used to have Chatgpt4 but I cancelled my subscription. Right at this moment, I don't believe a lot of businesses should be A MacBook Air with 16 GB RAM, at minimum. On Q8, base RAM required is in line with model size, 13B would be 13GB for example. 8GB of VRAM could handle quite a few things, for $350 it is definitely a good place to start. So theoretically the computer can have less system memory than GPU memory? For example, referring to TheBloke's lzlv_70B-GGUF provided Max RAM required: Q4_K_M = 43. High memory bandwidth capable of efficient data processing for both dense Subreddit to discuss about Llama, the large language model created by Meta AI. Most consumer GPU cards top out at 24 GB VRAM, but that’s plenty to run any 7b or 8b or 13b model. Though I put GPU speed fairly low because I've seen a lot of reports of fast GPUs that are blocked by slow CPUs. Also, I think you can probably find the VRAM necessary for a model somewhere on Google or reddit. 0 I've been using codellama 70b for speeding up development for personal projects and have been having a fun time with my Ryzen 3900x with 128GB and no GPU acceleration. reddit. Doubt it will add up to more than a few hundred before the next generation of gpus are released. For running models like GPT or BERT locally, you need GPUs with high VRAM capacity and a large number of CUDA cores. T^T In any case, I'm very happy with Llama-3-70b-Uncensored-Lumi-Tess-gradient, but running it's a challenge. 104 · 15 comments . Honestly best CPU models are nonexistent or you'll have to wait for them to be eventually released. Keeping that in mind, you can fully load a Q_4_M 34B model like synthia-34b-v1. Yes, you will have to wait for 30 seconds, sometimes a minute. If you are buying new equipment, then don’t build a PC without a big graphics card. Complete model can fit to VRAM, which perform calculations on highest speed. To lower latency, we simplify LLM decoder layer structure to reduce the data movement overhead. 00 tok/s stop reason: completed gpu layers: 13 cpu threads: 15 mlock: true token count: 293/4096 nous-capybara-34b Sorry if this is a dumb question but I loaded this model into Kobold and said "Hi" and had a pretty decent and very fast conversation, it was loading as fast as I could read and was a sensible conversation where the things it said in the first reply continued through the whole story May be it is a bit late, but the best way to go with CPU inference on x86 is Ryzen 7000 series. Through Poe, I access different LLM, like Gemini, Claude, Llama and I use the one that gives the best output. 6B to 120B (StableLM, DiscoLM German 7B, Mixtral 2x7B, Beyonder, Laserxtral, MegaDolphin) upvotes · comments r/LocalLLaMA Good for casual gaming and programs (backed by 1 comment) Users disliked: Frequent gpu crashes and driver issues (backed by 3 comments) Defective or damaged products (backed by 16 comments) Compatibility issues with bios and motherboards (backed by 2 comments) According to Reddit, AMD is considered a reputable brand. Alternatively, here is the GGML version which you could use with llama. I'm particularly interested in using Phi3 for coding, given its impressive benchmark results and performance Right LLM GPU requirements for your large language model (LLM) workloads are critical for achieving high performance and efficiency. 2x A100 80GB We won't work on very large LLM and we even may not try the T5 model. 950000, repeat_last_n = 2048, repeat_penalty = 1. cpp directly, which i also used to run. I remember that post. For smaller models, a single high-end GPU like the RTX 4080 can suffice, When launching koboldccp. I ran a 70B LLAMA2 Q5 on an M1 Ultra (48 core GPU, 128 GB RAM) using LM Studio and get between 6 and 7t/s. . Can these models work on clusters of multi gpu machines and 2gb gpu? For example, a watch with 8 pcie 8x slot and 8 gpu with Seen two P100 get 30 t/s using exllama2 but couldn't get it to work on more than one card. So I'll probably be using google colab's free gpu, which is nvidia T4 with around 15 GB of vRam. The model is around 15 GB with mixed precision, but my current hardware (old AMD CPU + GTX 1650 4 GB + GT Currently, I'm using Codegeex 9B for chat and Codeqwen-base for autocomplete. It uses roughly 50gb of RAM. when I was assembling a budget workstation for personal use, I had a pretty good idea of minimal requirements for the hardware Rules. Currently, my GPU Offload is set at 20 layers in LM Studio model settings. As the title says, I am trying to get a decent model for coding/fine tuning in a lowly Nvidia 1650 card. Its most popular types of products are: Graphics Cards (#8 of 15 brands on Reddit) The main contributions of this paper include: We propose an efficient LLM inference solution and implement it on Intel® GPU. 01/18: Apparently this is a very difficult problem to solve from an engineering perspective. Read on! 1 What Are The GPU Requirements For Local AI Text GPU Requirements: Depending on the specific Falcon model variant, GPUs with at least 12GB to 24GB of VRAM are recommended. use GPT4 to evaluate output of llama. Apple Silicon Macs have fast RAM with lots of bandwidth and an integrated GPU that beats most low end discrete GPUs. Unless 16gb ram isn’t enough, though the meta website says a minimum of 8gb is required. 2 Gb/s bandwidth LLM - assume that base LLM store weights in Float16 format. 0 Gaming Graphics Card, IceStorm 2. With dual 3090, 48 GB VRAM opens up doors to 70b models entirely on VRAM The VRAM capacity of your GPU must be large enough to accommodate the file sizes of models you want to run. Budget: Around $1,500 Requirements: GPU capable of handling LLMs efficiently. 2GB of vram usage (with a bunch of stuff open in View community ranking In the Top 1% of largest communities on Reddit [D] Calculate GPU Requirements for Your LLM Training . The process of offloading only specific amount of layers to GPU? As I said, ollama is build on top of llama. Considering current prices, you'd spend around $1500 USD for My GPU is a GTX Nvidia 3060 with 12GB. g. This practical guide will walk you through evaluating and selecting the best GPUs for LLM. it's probably by far the best bet for your card, other than using lama. Best gpu models are those with high vram (12 or up) I'm struggling on 8gbvram 3070ti for instance. 92 GB So using 2 GPU with 24GB (or 1 GPU with 48GB), we could offload all the Yes, oogabooga can load models split across both the cpu+gpu. Even 7900XTX is pretty good from my limited testing, I hope AMD manages to tap into some of that ML performance and I want to run a 70B LLM locally with more than 1 T/s. the model name the quant type (GGUF and EXL2 for now, GPTQ later) the quant size the context size cache type ---> Not my work, all the glory belongs to NyxKrage <--- The #1 Reddit source for news, information, and discussion about modern board games and board game culture. 100000 generate: n_ctx = 2048, n_batch = 512, n_predict = 65536, n_keep = 0 I was not able to read all of it, but the text (a story) contains the protagonist's name all the way to the end, the characters develop and it seems the story could be I will most likely be choosing a new operating system, but first was recommended (by the previous owner) to choose the most relevant LLM that would be optimized for this machine. For $350 I think you could buy a used RTX 3070. 0 Advanced Cooling, Spectra 2. A 4090 won't help if it can't get data fast enough. Most LLMs struggle to pick one of the 500 words, most often selecting random words not on the list no matter how emphatic the prompt. 🐺🐦⬛ LLM Comparison/Test: 6 new models from 1. it’s a brand new gaming pc with a 4060ti gpu. NVIDIA A100 Tensor Core GPU: A powerhouse for LLMs with 40 GB or more VRAM, Here is my benchmark-backed list of 6 graphics cards I found to be the best for working with various open source large language models locally on your PC. Generally the bottlenecks you'll encounter are roughly in the order of VRAM, system RAM, CPU speed, GPU speed, operating system limitations, disk size/speed. Otherwise 20B-34B with 3-5bpw exl2 quantizations is best. I have a 3090 with 24GB VRAM and 64GB RAM on the system. galpvv kvytfcv hiis mmhe flswz lqtl vrqwvb ouvciz abxvvg qffsn